Demo: Statistical Tests for Continuous Data

One sample test

We first create data. In particular we create a continuous vector:

set.seed(123)
x <- rnorm(n = 300, mean = 10, sd = 5)

Null hypothesis: the mean of x is equal to 0. We have a large sample size so we can use the t-test.

t.test(x = x, mu = 0)

## 
##  One Sample t-test
## 
## data:  x
## t = 37.258, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.634917 10.709497
## sample estimates:
## mean of x 
##  10.17221

If you check the help page you will see that mu = 0 is the default option. This means that we can remove this part:

t.test(x = x)

## 
##  One Sample t-test
## 
## data:  x
## t = 37.258, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.634917 10.709497
## sample estimates:
## mean of x 
##  10.17221

Output interpretation: The mean sample is 10.11 and the confidence interval is (9.63, 10.71). The p-value is very small so we reject the null hypothesis that the sample mean is equal to 0. The test statistic is 37.26. The test statistic can be obtained using the formula: \(\frac{\bar{x} - \mu_0}{sd(x)\sqrt{n}}\)

test_stat <- mean(x)/(sd(x) / sqrt(300))

Now let’s assume that we want to investigate whether the sample mean is equal to 10:

t.test(x = x, mu = 10)

## 
##  One Sample t-test
## 
## data:  x
## t = 0.63074, df = 299, p-value = 0.5287
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
##   9.634917 10.709497
## sample estimates:
## mean of x 
##  10.17221

In that case the p-value is too large to reject the null hypothesis. The test statistic can be also obtain as:

test_stat <- (mean(x) - 10)/(sd(x) / sqrt(300))

The p-value can be also obtained as:

2 * pt(q = test_stat, df = 300 - 1, lower.tail = FALSE)

## [1] 0.5286913

2 * (1 - pt(q = test_stat, df = 300 - 1, lower.tail = TRUE))

## [1] 0.5286913

By default, a two-sided test is performed. To do a one-sided test, the argument alternative can be set to less or greater:

t.test(x, mu = 10, alternative = 'less')

## 
##  One Sample t-test
## 
## data:  x
## t = 0.63074, df = 299, p-value = 0.7357
## alternative hypothesis: true mean is less than 10
## 95 percent confidence interval:
##      -Inf 10.62269
## sample estimates:
## mean of x 
##  10.17221

t.test(x, mu = 10, alternative = 'greater')

## 
##  One Sample t-test
## 
## data:  x
## t = 0.63074, df = 299, p-value = 0.2643
## alternative hypothesis: true mean is greater than 10
## 95 percent confidence interval:
##  9.721728      Inf
## sample estimates:
## mean of x 
##  10.17221

Furthermore, we can change the confidence interval level using the argument conf.level:

t.test(x, mu = 10, alternative = 'less', conf.level = 0.975)

## 
##  One Sample t-test
## 
## data:  x
## t = 0.63074, df = 299, p-value = 0.7357
## alternative hypothesis: true mean is less than 10
## 97.5 percent confidence interval:
##     -Inf 10.7095
## sample estimates:
## mean of x 
##  10.17221

What if we do not want to print the whole output? In that case we can save the test results as an object and then select the parts that we want to print:

test_res <- t.test(x, mu = 10, alternative = 'less', conf.level = 0.975)
test_res$statistic

##         t 
## 0.6307416

test_res$p.value

## [1] 0.7356544

test_res$null.value

## mean 
##   10

Let’s now assume that we only have 30 subjects (small sample size). We first create the data:

set.seed(123)
x <- rnorm(n = 30, mean = 10, sd = 5)

Null hypothesis: the median of x is equal to 0. We have a small sample size so we can use the Wilcoxon signed rank test:

wilcox.test(x = x, mu = 0)

## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 465, p-value = 1.863e-09
## alternative hypothesis: true location is not equal to 0

Note that confidence intervals are only returned if conf.int = TRUE:

wilcox.test(x = x, mu = 0, conf.int = TRUE)

## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 465, p-value = 1.863e-09
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
##   7.713874 11.621940
## sample estimates:
## (pseudo)median 
##       9.680038

The additional argument exact controls if exact p-values and confidence intervals are calculated or if the normal approximation is used. In the latter case, the argument correct determines if a continuity correction is applied.

wilcox.test(x = x, mu = 0, exact = TRUE)

## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 465, p-value = 1.863e-09
## alternative hypothesis: true location is not equal to 0

wilcox.test(x = x, mu = 0, exact = FALSE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  x
## V = 465, p-value = 1.825e-06
## alternative hypothesis: true location is not equal to 0

Specific parts of the output can be also extracted:

test_res <- wilcox.test(x = x, mu = 0, exact = TRUE)
test_res$statistic

##   V 
## 465

The test statistic \(W_-\) and \(W_+\) can be also obtained as:

res <- rank(abs(x-0)) # take ranks of the absolute differences
sum(res[(x-0)<0]) # sum all negative differences

## [1] 0

sum(res[(x-0)>0]) # sum all positive differences

## [1] 465

Two samples test

We first create data. In particular we create two continuous vectors:

set.seed(123)
x <- rnorm(n = 300, mean = 10, sd = 5)
y <- rnorm(n = 300, mean = 11, sd = 2)

Null hypothesis: the mean of x is equal to the mean of y. Let’s assume that the samples are independent. We have a large sample size so we can use the t-test.

t.test(x = x, y = y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -2.8587, df = 400.49, p-value = 0.004475
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.427798 -0.264225
## sample estimates:
## mean of x mean of y 
##  10.17221  11.01822

It is also possible to specify the test using a formula. This is useful when we have the data in a data.frame:

dat <- data.frame(value = c(x, y), group = rep(x = c(1, 2), each = length(x)))
t.test(value ~ group, data = dat)

## 
##  Welch Two Sample t-test
## 
## data:  value by group
## t = -2.8587, df = 400.49, p-value = 0.004475
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -1.427798 -0.264225
## sample estimates:
## mean in group 1 mean in group 2 
##        10.17221        11.01822

By default, the test assumes that the two samples have different variances. Check the help file for all this information!

t.test(x = x, y = y, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  x and y
## t = -2.8587, df = 598, p-value = 0.004401
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4272165 -0.2648068
## sample estimates:
## mean of x mean of y 
##  10.17221  11.01822

F test can be used to check if two samples have the same variance:

var.test(x = x, y = y)

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 5.7173, num df = 299, denom df = 299, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  4.555655 7.175163
## sample estimates:
## ratio of variances 
##           5.717304

Let’s now assume that the samples are dependent. In that case we need to set the argument paired = TRUE:

t.test(x = x, y = y, paired = TRUE)

## 
##  Paired t-test
## 
## data:  x and y
## t = -2.7989, df = 299, p-value = 0.005461
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -1.4408460 -0.2511774
## sample estimates:
## mean difference 
##      -0.8460117

This is equivalent to performing a one-sample t-test of the differences x - y:

t.test(x = x - y)

## 
##  One Sample t-test
## 
## data:  x - y
## t = -2.7989, df = 299, p-value = 0.005461
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.4408460 -0.2511774
## sample estimates:
##  mean of x 
## -0.8460117

We can change the mean to test if the difference is different from a value instead of testing for a difference equal to zero.

t.test(x = x, y = y, mu = 10, paired = TRUE)

## 
##  Paired t-test
## 
## data:  x and y
## t = -35.883, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 10
## 95 percent confidence interval:
##  -1.4408460 -0.2511774
## sample estimates:
## mean difference 
##      -0.8460117

Let’s now assume that we only have 30 subjects (small sample size). We first create the data:

set.seed(123)
x <- rnorm(n = 30, mean = 10, sd = 5)
y <- rnorm(n = 30, mean = 11, sd = 2)

Null hypothesis: the distribution of x is equal to the distribution of y. Let’s assume that the samples are independent. We have a small sample size so we can use the Wilcoxon rank sum test:

wilcox.test(x = x, y = y, correct = TRUE, conf.int = TRUE)

## 
##  Wilcoxon rank sum exact test
## 
## data:  x and y
## W = 331, p-value = 0.07973
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -3.7349469  0.1653389
## sample estimates:
## difference in location 
##              -1.857596

Check the help page for the correct argument. Let’s now assume that the samples are dependent. In that case we can use the Wilcoxon singed rank test:

wilcox.test(x = x, y = y, paired = TRUE)

## 
##  Wilcoxon signed rank exact test
## 
## data:  x and y
## V = 156, p-value = 0.1191
## alternative hypothesis: true location shift is not equal to 0

M sample test

We first create data. In particular we create three continuous vectors:

set.seed(123)
x <- rnorm(n = 300, mean = 10, sd = 5)
y <- rnorm(n = 300, mean = 11, sd = 2)
z <- rnorm(n = 300, mean = 15, sd = 7)

Null hypothesis: the means of x, y and z are identical. We have a large sample size so we can use the anova test.

dat <- data.frame(value = c(x, y, z), group = rep(x = c(1, 2, 3), each = length(x)))

boxplot(value ~ group, data = dat)

test_res <- aov(formula = value ~ group, data = dat)
summary(test_res)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## group         1   3667    3667     137 <2e-16 ***
## Residuals   898  24032      27                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s now assume that we only have 30 subjects (small sample size). We first create the data:

set.seed(123)
x <- rnorm(n = 30, mean = 10, sd = 5)
y <- rnorm(n = 30, mean = 11, sd = 2)
z <- rnorm(n = 30, mean = 15, sd = 7)
dat <- data.frame(value = c(x, y, z), group = rep(x = c(1, 2, 3), each = length(x)))

Null hypothesis: the distributions of x, y and z are identical. We have a small sample size so we can use the Kruskal-Wallis test, which is an extension to the Wilcoxon rank sum test to more than two groups:

# (all following options will provide the same result)
kruskal.test(x = dat$value, g = dat$group)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  dat$value and dat$group
## Kruskal-Wallis chi-squared = 17.188, df = 2, p-value = 0.0001852

kruskal.test(x = list(x, y, z))

## 
##  Kruskal-Wallis rank sum test
## 
## data:  list(x, y, z)
## Kruskal-Wallis chi-squared = 17.188, df = 2, p-value = 0.0001852

kruskal.test(formula = value ~ group, data = dat)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  value by group
## Kruskal-Wallis chi-squared = 17.188, df = 2, p-value = 0.0001852

Correlation test

We first create the data:

set.seed(123)
x <- rnorm(n = 300, mean = 10, sd = 5)
y <- rnorm(n = 300, mean = 11, sd = 2)

Null hypothesis: the variables x and y are independent (no correlation). By default, the Pearson correlation is assumed.

cor.test(x = x, y = y)

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1.0496, df = 298, p-value = 0.2947
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17274411  0.05291402
## sample estimates:
##         cor 
## -0.06069048

Alternatively, we can obtain the test statistic and p-value using the formula \(\frac{\rho \sqrt{n-2}}{\sqrt{1-\rho^2}}\):

test_stat <- (cor(x, y) * sqrt(300 - 2))/sqrt(1 - cor(x,y)^2)
pVal <- 2 * pt(q = test_stat, df = 300 - 2, lower.tail = TRUE)
pVal

## [1] 0.2947457

Let’s now assume that we only have 30 subjects (small sample size). We first create the data:

set.seed(123)
x <- rnorm(n = 30, mean = 10, sd = 5)
y <- rnorm(n = 30, mean = 11, sd = 2)

Null hypothesis: the variables x and y are independent (no correlation). We have a small sample size so we can use the Spearman correlation by changing the method argument:

# (with the `exact` argument we can select whether we want to perform the exact
# test or the approximate test)
cor.test(x = x, y = y, method = "spearman", exact = FALSE)

## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 5030, p-value = 0.531
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1190211

Other correlation coefficients include the Kendall:

cor.test(x = x, y = y, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  x and y
## T = 199, p-value = 0.5239
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.08505747

Demo: Statistical Tests for Continuous Data

Eleni-Rosalina Andrinopoulou, Department of Biostatistics, Erasmus Medical Center

2022-11-07

One sample test

Two samples test

M sample test

Correlation test